Introduction

We intent to find out what makes a start-up attractive for acquisition, and what drives its acquisition price up or down.

In this study, we plan to analyze the acquisition data of more than 2000 companies from all over the world, which are founded from year 1900 to year 2014. Particulary, we have interests in exploring what factors impact the acquisition price of a company, and trying to predict the company’s acquisition price using regression. The data set has 2473 rows of data.

We consider the following variable as the response variable:

variable name description data scoure data type variable type # of missing records
aquisition_price_amount Amount paid for aquisition acquisitions.csv int numerical 0

We consider the following 16 variables as the potential predictor:

variable name description data scoure data type variable type # of missing records
category_code Entity category objects.csv string categorical 475
normalized_name Normalized entity name objects.csv string categorical 0
logo_width Logo width objects.csv float numerical 0
logo_height Logo height objects.csv float numerical 0
description Description of the entity objects.csv string categorical 1380
country_code Country code objects.csv string categorical 514
state_code State code objects.csv string categorical 980
city City name objects.csv string categorical 580
region Region name objects.csv string categorical 1110
investment_rounds Number of investment round participated in objects.csv int numerical 0
investment_companies Number of companies invested in objects.csv int numerical 0
milestones Number of milestones the entity has objects.csv int numerical 0
relationships Number of relationships the entity has objects.csv int numerical 0
istop25 Is the founder from top25 university founders.csv boolean 0
istop26_50 Is the founder from top26 - 50 university founders.csv boolean 0

Background information of the data set

The database comes from a Kaggle public dataset about startup investments published by Justinas Cirtautas at the end of 2019. The dataset covers more 450 thousand company information from all over the world till 2013. Among those, about 10 thousand companies acomplished their acquisition successfully, while 80% of them got 0 dollars in their acquisition.

The entire dataset provides 11 cvs file with 154 columns, which covers six aspects of the startup ecosystem including organizations, individuals, company news, funding rounds, acquisitions, and IPOs. We populated all the 11 csv files. More detials can be found in the appendix at the end of this proposal. Considering the missing data rate, we determine to focus on only two files: acquisition.csv and objects.csv.

  • acquisition.csv provides the acquisition details, including the acquisition amount, currency and the deal date.
  • objects.csv provides the basic company profiles, which can be used as the predictor of our study.
  • founders.csv provides the basic founders profiles, which can be used as the predictor of our study.

Workflow of the Project

Venture capital investments hits 9.5 billion U.S. dollars in the internet industry in the United States, as of first quarter 2020. Other leading VC sectors in terms of investment were healthcare, and software. Inspired by the florish startup world, we want to investigate what factors can impact the company price during acquisition.

In this study, we select the price of the company for acquisition as our ‘Y’, the response variable. Our target is to find a set of predictors and use regression method to predict our Y. We checked all dataset and select 16 varaibles as our potential predictors.

First, we did the data exploration on the raw data file. We checked the relationship between every single variable and the response and the correlation matrix. Details can be found in the section of Methods - Data exploration. To overcome the issue that the catogorical data has too many levels, we performed data wrangling to only keep the major levels in the category. Details can be found in the section of Methods - Data Adjustment.

Second, we considering the three classes of model: additive model, interaction model, and interation model with boxcox transfermation. The model complexities are increasing among those three classes. Within each class, we use the backwards selection and forwards selection with AIC/BIC to do the model selection. Details can be found in section of Methods - Model building. To eccelrate the model comparison process, we defined several functions to make it possible to output multiple evaluation metrics by only one line of code. Details can be found in Methods - Diagnostics and Performance Function.

In the end, We end up with two final regression models. One is the best model we got when considering all the interation effect. The other one is the best model we got when considering the interaction plus the boxcox transfermation. Both models include more than 600 predictors (so we cannot print out our final model explicitly). Even though we start from only 16 variables, but because some of the categorical variable has more than 20 levels. After adding all the interactions, we have a huge amount of parameters in our final regression model. In this way, this model is more suitable for prediction rather than explanation. Both model achevie the R-square of more than 0.6. A comprehensive comparison on our two final models can be found in the section of Results and Discussion.

Takeaways from Modeling Procedure

  • It is important to keep the number of levels in the categorical variables in a moderate amount.

  • For a model with more 600 coefficients, the effect which is significant in the univariate analysis can become unsignificant in the full regression, since it can be masked by the effect of other variables.

  • When we use the boxcox transfermation, there is a chance that the regression produce some negative outcomes and it cannot be transfered back. This can be deemed as a limitation of the boxcox plot.

Methods

Data Exploration

In this section, we focus on the exploratin of all the predictor data. We first performed an univariate analysis and then followed by the correlation analysis.

Univarite Analysis

# read the dataset
data0 = read.csv("./startups.csv", header = TRUE)
#data0 = read.csv('startups1.csv', header = TRUE)

First, we checked the distribution of aquisition price.

hist(data0$aquisition_price_amount, breaks = 100)

As we can see from the plot above, the distrbution is highly right skewed. Then we use the box plot to check if there is any outliers.

boxplot(data0$aquisition_price_amount)

As we can see there is one record in the dataset with extremely high acquicition price. We checked the database further and found its acqusition price is 2600 billion, which equals the valuation of google and amazon. So we think this one should be an database error, we will exclude this database in our further analysis.

data1 = data0[data0$aquisition_price_amount < 10^12,]
boxplot(data1$aquisition_price_amount)

After excluding the error in our database, the distribution of aquisition price is still highly right skewed. We will consider the log transformation on the aquisition price.

data1["logprice"] = apply(data1["aquisition_price_amount"],2,log)
hist(data1$logprice, breaks = 100)

boxplot(data1$logprice)

The distributino of log aquisition price is relatively sysmetric. From the boxplot we can still find there are many records is outside of the 1.75 IQR. But we think we can start from here to build our model.

Then we checked the precdictor in the database one by one. For category_code, this is a categorical varaible.

summary(data1$category_code)
##      advertising        analytics       automotive          biotech 
##               84                3                3              263 
##        cleantech       consulting        ecommerce        education 
##               39               32               50                6 
##       enterprise          fashion          finance      games_video 
##              115                3               20               88 
##         hardware           health      hospitality            local 
##               78               10                7                1 
##    manufacturing          medical        messaging           mobile 
##               21                4                5              117 
##            music         nanotech  network_hosting             news 
##                3                1               64                9 
##            other      photo_video public_relations      real_estate 
##               84                5               74                6 
##           search         security    semiconductor           social 
##               19               36               71                4 
##         software           sports   transportation           travel 
##              403                2                3                4 
##          unknown              web 
##              475              260
plot(data1$category_code, data1$logprice)

The records concentrated in several major levels of category_code. As for the boxplot between category_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

Then we turn to the name of the company. We extract the length of the company name as a feature.

data1["name_length"] = apply(data1["normalized_name"],2,nchar)
plot(data1$name_length, data1$logprice)

After check the relationship between length of the company name, we found if the company name has more than 30 characters, the acquisition price tend to decrease. so we plan to create a dummy variable, company name lenght > 30 or not, and add it to our candidate predictor list.

Then we check the logo width.

plot(data1$logo_width, data1$logprice)
fit = lm(logprice ~ logo_width, data1)
abline(fit, col = "red")

For the plot we found in fact, quite amount of the logo_width is 0. Then we guess, maybe in this column, 0 means missing value. so we decide to remove that part and replot the above figure.

sum(data1$logo_width == 0)
## [1] 502
temp = data1[data1$logo_width > 0, ]
plot(temp$logo_width, temp$logprice)
fit = lm(logprice ~ logo_width, temp)
abline(fit, col = "red")

It turns out there are 502 records in total has 0 logo_width. After removing them, the relationship between logo_width and log price become not that significant. considering the maximum logo_width is more than 5000 times of the minimum width, the data reliability becomes skeptical.

We further checked the log_height.

plot(data1$logo_height, data1$logprice)
fit = lm(logprice ~ logo_height, data1)
abline(fit, col = "red")

sum(data1$logo_height == 0)
## [1] 502
temp = data1[data1$logo_height > 0, ]
plot(temp$logo_height, temp$logprice)
fit = lm(logprice ~ logo_height, temp)
abline(fit, col = "red")

It show the same pattern. Out of curiosity, we create a variable named, logo_ratio, which is the ration of logo height and width.

mask = data1$logo_height > 0
data1["logo_ratio"] = data1["logo_height"] / data1["logo_width"]
plot(data1$logo_ratio[mask], data1$logprice[mask])
fit = lm(logprice ~ logo_ratio, data1[mask,])
abline(fit, col = "red")

But it turns out seems this variable does not help much. Considering the data sanity issue in those columns, we did not put them to our candidate predictor list.

Then we checked the country_code, which is a categorical variable.

summary(data1$country_code)
##     ANT     ARE     ARG     AUS     AUT     BEL     BMU     BRA     CAN     CHE 
##       1       1       2      23       3       3       1       4      53       8 
##     CHL     CHN     CYM     CZE     DEU     DNK     ESP     FIN     FRA     GBR 
##       1      23       1       1      23       5       5       4      20     100 
##     GHA     IDN     IND     IRL     ISR     ITA     JOR     JPN     KEN     KOR 
##       1       1      10       8      34       6       1      11       1       7 
##     LBN     LUX     MAR     MDA     MEX     MMR     MYS     NCL     NLD     NOR 
##       1       2       1       1       3       1       1       1      13       4 
##     NZL     PAK     PRT     RUS     SAU     SGP     SWE     TUN     TUR     TWN 
##       1       2       1       4       1       4      14       1       1       1 
## unknown     USA     VGB     ZAF 
##     514    1532       1       5
plot(data1$country_code, data1$logprice)

The records concentrated in several major levels of country_code. To be more specific, most records is in USA. As for the boxplot between country_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

The same observation found in city.

summary(data1$city)
##             unknown       San Francisco            New York       Mountain View 
##                 580                  99                  84                  44 
##         Santa Clara              London           Sunnyvale           Palo Alto 
##                  44                  42                  36                  31 
##            San Jose             Seattle           Cambridge           San Diego 
##                  29                  29                  28                  27 
##              Austin             Chicago         Los Angeles        Redwood City 
##                  24                  22                  21                  20 
##              Boston              Irvine             Waltham           San Mateo 
##                  19                  17                  16                  15 
##             Atlanta            Bellevue              Dublin             Houston 
##                  14                  13                  12                  12 
##           Cupertino              Dallas             Fremont             Boulder 
##                  11                  11                  11                  10 
##              Denver         Marlborough          Menlo Park            Tel Aviv 
##                  10                   9                   9                   9 
##               Paris             Raleigh        Santa Monica               Tokyo 
##                   8                   8                   8                   8 
##             Bedford              Durham          Pittsburgh              Reston 
##                   7                   7                   7                   7 
## South San Francisco           Amsterdam           Arlington            Brisbane 
##                   7                   6                   6                   6 
##            Carlsbad           Charlotte           Englewood             Hayward 
##                   6                   6                   6                   6 
##             Herndon            Milpitas        Philadelphia             Phoenix 
##                   6                   6                   6                   6 
##            Portland               Seoul             Toronto              Vienna 
##                   6                   6                   6                   6 
##         Aliso Viejo           Baltimore            Columbia              Lowell 
##                   5                   5                   5                   5 
##         Morrisville       Overland Park          Pleasanton          Scottsdale 
##                   5                   5                   5                   5 
##           Stockholm          Alpharetta       American Fork             Beijing 
##                   5                   4                   4                   4 
##             Bothell             Bristol            Brooklyn            Campbell 
##                   4                   4                   4                   4 
##        Corte Madera          El Segundo          Emeryville        Gaithersburg 
##                   4                   4                   4                   4 
##        Indianapolis              Irving             Malvern              McLean 
##                   4                   4                   4                   4 
##           Melbourne           Milwaukee         Minneapolis              Moscow 
##                   4                   4                   4                   4 
##       New York City              Newton              Oxford               Plano 
##                   4                   4                   4                   4 
##           Princeton          Richardson         San Antonio           San Bruno 
##                   4                   4                   4                   4 
##          San Carlos              Sydney            Tel-Aviv         Toronto, ON 
##                   4                   4                   4                   4 
##    Westlake Village              Woburn           Ann Arbor             (Other) 
##                   4                   4                   3                 777
plot(data1$city, data1$logprice)

The records concentrated in several major levels of city. As for the boxplot between city_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

For investiment_rounds, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$investment_rounds, data1$logprice)

For investiment_companies, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$invested_companies, data1$logprice)

For milestones, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$milestones, data1$logprice)

For relationships, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$relationships, data1$logprice)
fit = lm(logprice ~ relationships, data1)
abline(fit, col = "red")

For relationships, we found it showed a company whose founder has a degree of top25 university tend to have higher acquisition price. We added it to our candidate predictor list.

boxplot(data1$istop25, data1$logprice)

For top25, we found it showed a company whose founder has a degree of top 25 university tend to have higher acquisition price. We added it to our candidate predictor list.

boxplot(data1$istop26_50, data1$logprice)

For top26_50, we found it showed a company whose founder has a degree of top 26 to top 50 university tend to have higher acquisition price. We added it to our candidate predictor list.

Correlation Analysis

Then we checked the correlation matrix of the numerical variables.

data2 = data1[,c("name_length", "investment_rounds", "invested_companies", "milestones", "relationships")]
cor(data2)
##                    name_length investment_rounds invested_companies milestones
## name_length         1.00000000       -0.04786664        -0.04856472 -0.2870902
## investment_rounds  -0.04786664        1.00000000         0.98970339  0.1519085
## invested_companies -0.04856472        0.98970339         1.00000000  0.1576869
## milestones         -0.28709018        0.15190852         0.15768691  1.0000000
## relationships      -0.14405538        0.39209720         0.40107629  0.4476997
##                    relationships
## name_length           -0.1440554
## investment_rounds      0.3920972
## invested_companies     0.4010763
## milestones             0.4476997
## relationships          1.0000000

It turns out the investment_rounds and invested_companies shows strongly collinearity. We should only use one of them in the model to avoid the collinearity issue.

Data Adjustments

Based on the single predictor analysis above, it is now possible to further simplify the dataset that will be used to determine whether the aquisition_price of startups can be correlated with startup characteristics and their performance.

Some data adjustments include:

  • Collapsing of the number of levels for factor variables with a high number levels: category_code, country_code, city, state_code. This will ommit potential dummy variables when dealing with factor variables with levels that provide little significance. The fewer levels and variables also adds benefit of simplifying the model.
  • Removal of columns that are not useful in the statitical analysis: company_id,normalized_name, status, logo_width, logo_height. These are pure descriptions of the startups, and while these could have been a source for additional continuous (e.g. word counts) or discreet variables (mention of certain trendy keywords), we decided to leave out for now to narrow down the analysis on the other variables.
  • Division of the full data into 2 datasets: one for training (startups_tr), with 80% of the observations, and the rest will be included in a test dataset (startups_te), useful for model cross-validation. The split is done randomly.
  • Removal of outlier case of acquisition for 2.6T. For reference, Amazon’s Market Cap is 1.6T. We believe this was an input error.
#Loading of the data
library(readr)
startups = read_csv("./startups.csv")
attr(startups, "spec") = NULL
n = nrow(startups)

#Function that reduces the number of levels of a factor, by replacing some levels with the new.level based on the number of records a level has. 
#If threshold < 1, then it keeps the largest levels by counts up to when a the threshold percentile is reached
#If threshold >= 1, then it keeps the levels that have counts greater or equal to the threshold.
shorten_levels = function(col, threshold, new.level){
  new_col = as.character(col) 
  levels_to_collapse = 0
  level_counts = summary(col)
  counts = aggregate(new_col, by = list(col = new_col), FUN = length)

  if(threshold < 1){
    counts = counts[order(counts$x, decreasing = TRUE), ]
    counts$cumpct = cumsum(counts$x)/sum(counts$x)
    levels_to_collapse = counts[counts$cumpct >= threshold, "col"]
  }else{
    #levels_to_collapse = names(which(level_counts > threshold))
    levels_to_collapse = counts[counts$x < threshold, "col"]
  }
  
  replace = new_col %in% levels_to_collapse
  new_col[replace] = new.level
  as.factor(new_col)
}

#Modified function from https://www.mjdenny.com/Text_Processing_In_R.html to "normalize text"
clean_string = function(string){
    # Lowercase
    temp =  tolower(string)
    # Remove everything that is not a number or letter (
    temp = stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
    # Shrink down to just one white space
    temp = stringr::str_replace_all(temp,"[\\s]+", " ")
    temp
}

#Collapse the number of levels for all factor predictors.
startups$category_code = shorten_levels(startups$category_code, .90, "other")
startups$country_code = shorten_levels(startups$country_code, 20, "other")
startups$city = shorten_levels(clean_string(as.character(startups$city)),20, "other")
startups$state_code = shorten_levels(as.character(startups$state_code),.90, "other")
summary(startups$state_code)
##      CA      CO      FL      IL      MA      MD      NJ      NY   other      PA 
##     581      34      37      42     118      32      46     110     273      41 
##      TX unknown      VA      WA 
##      77     980      43      59
#Remove unnecessary columns (particularly the descriptions and single-level factors and variables that did not seem relevant for the aquisition response variable)
to_remove = c("company_id","normalized_name", "status", "logo_width", "logo_height")
startups_desc = startups[, to_remove]
startups = startups[, -which(colnames(startups) %in% to_remove)]

#Remove outlier in acquisition price
startups = startups[startups$aquisition_price_amount<1e12,]

#Split of data for cross-validation: 80% for training, 20$ for testing.
set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]

The final dataset for analysis includes the following predictors.

str(startups)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2472 obs. of  16 variables:
##  $ category_code          : Factor w/ 13 levels "advertising",..: 4 13 4 5 8 5 3 13 1 8 ...
##  $ country_code           : Factor w/ 10 levels "AUS","CAN","CHN",..: 10 10 10 10 10 10 10 10 7 8 ...
##  $ state_code             : Factor w/ 14 levels "CA","CO","FL",..: 1 1 1 11 9 1 1 1 12 12 ...
##  $ city                   : Factor w/ 17 levels "austin","cambridge",..: 8 6 12 8 8 8 12 6 8 8 ...
##  $ investment_rounds      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ invested_companies     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ funding_rounds         : num  0 1 1 1 3 1 5 2 0 0 ...
##  $ funding_total_usd      : num  0 5000000 3000000 274999 15286415 ...
##  $ milestones             : num  0 3 2 0 1 3 3 3 0 2 ...
##  $ relationships          : num  6 14 6 4 2 7 38 2 2 59 ...
##  $ aquisition_price_amount: num  2.00e+07 4.75e+07 1.50e+07 2.62e+08 3.65e+07 ...
##  $ top25pct               : num  0 64 0 100 0 0 55 100 0 20 ...
##  $ top50pct               : num  0 18 0 0 0 0 0 0 0 7 ...
##  $ istop25                : num  0 1 0 1 0 0 1 1 0 1 ...
##  $ istop26_50             : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ name_length            : num  7 10 7 8 9 9 6 8 9 8 ...

Diagnostics and Performance Function

In order to compare models, we use a single function that runs tests on both:

  • Model assumptions: Residual vs fitted and Q-Q plots, bptest and wilk-shapiro test.
  • Model performance: Adjusted r-squared, AIC, BIC, leave one out cross-validation, and regression p-value.

These diagnostics functions are further used in another set of utility functions that provide a side-by-side view of the model performance across the model assumptions, performance, model parameters and their caracteristics.

library(lmtest)
library(MASS)
library(faraway)

get_outliers = function(model){
  stdresid = rstandard(model)
  outliers = abs(stdresid) > 2
  stats = c(
    num_outliers = sum(outliers),
    pct_outliers = mean(outliers)
  )
  
  list(stats = stats, outliers = outliers)
}

get_influentials = function(model){
  cooks = cooks.distance(model)
  influential = cooks > (4 / length(cooks))
  
  stats = c(
    num_influentials = sum(influential),
    pct_influentials = mean(influential)
  )

  list(stats = stats, influentials = influential)
}

get_rmse = function(model, predict_function = NULL, test_actuals = NULL){
  predictions = 0
  if(!is.null(predict_function))
    predictions = predict_function(model)

  rsme_val = 0
    
  if(!is.null(test_actuals)) 
    rsme_val = sqrt(mean((test_actuals - predictions) ^ 2)) 
  else
    rsme_val = sqrt(mean(resid(model) ^ 2))
      
  rsme_val
}

# Diagnostic function that performs multiple hypothess tests for the model significance, whether the linear model assumptions are valid, and other model performance metrics 
diagnostics_performance = function(model, pcol = "grey",  lcol = "dodgerblue", plotit = TRUE,  testit = TRUE, response_transform = NULL, train_actuals = NULL, as_data_frame=TRUE) {
  #Plot functionality
  if (plotit) {
    par(mfrow = c(1, 2))
    
    #Fitted vs residuals plot
    plot(resid(model) ~ fitted(model), 
         col = pcol, pch = 20, 
         main = "Fitted vs Residuals Plot", xlab = "Fitted", ylab = "Residuals")
    abline(h = 0, col = lcol, lwd = 2)
    
    #Q-Q plot
    qqnorm(resid(model), col = pcol, main = "Normal Q-Q Plot")
    qqline(resid(model),  lty = 2, lwd = 2, col = lcol)
  }
  
  #Test functionality
  if (testit) {
    resids = resid(model)
    if(!is.null(response_transform) && !is.null(train_actuals)){
      fits = response_transform(fitted(model))
      resids = train_actuals-fits
    }
    shapiro_test_pval = shapiro.test(resids)$p.value 
    bp_test_pval = unname(bptest(model)$p.value)
    rmse_loocv = sqrt(mean((resids / (1 - hatvalues(model))) ^ 2))
    r_squared = summary(model)$r.squared
    adj_r_squared = summary(model)$adj.r.squared
    aic = extractAIC(model)[2]
    bic =  extractAIC(model, k = log(n))[2]
    outliers = unname(get_outliers(model)$stats)
    influentials = unname(get_influentials(model)$stats)
    model_pval = unname(pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3], lower.tail=FALSE))
    rmse_train = sqrt(mean(resids ^ 2))

  
    #Output from test
    res = list(
      "coefficients" =  nrow(summary(model)$coefficients),
      "model.pval" = model_pval,
      "r2" =  r_squared,
      "rmse.train" = rmse_train,
      "rmse.test" = NA,
      "bptest.pval" =  bp_test_pval,
      "shapiro.pval" =  shapiro_test_pval,
      "AIC" =  aic,
      "BIC" =  bic,
      "adjR2" =  adj_r_squared,
      "rsme_loocv" =  rmse_loocv,
      "outliers.num" = outliers[1],
      "outliers.pct" = outliers[2],
      "influential.num" = influentials[1],
      "influential.pct" = influentials[2])
    
    if(as_data_frame)
      res = data.frame(unlist(res))
    colnames(res) = NULL
    res
  }
}


#Shows the variable infation factors and their variables for the parameters with high variability due to colinearity.
get_vifs = function(model){
  vifs = faraway::vif(model)
  vifs = data.frame(coeficient =names(vifs[vifs>5]), vif = unname(vifs[vifs>5]))
  vifs = vifs[order(vifs$vif, decreasing=TRUE),]
  vifs
}

#Compares  two or more models side-by-side in a single dataframe, all model assumptions and prformance.
compare_model_performance = function(models, predict_transforms = NULL, test_data, test_actuals_col, train_actuals = NULL, as_data_frame = FALSE){
  dummy = unlist(diagnostics_performance(models[[1]], plotit = FALSE, as_data_frame = FALSE))
  model_comp = matrix(0, nrow = length(dummy), ncol = length(models))
  #col_names = rep(0, length(models))
  train_actuals = unlist(train_actuals)
  
  for(m in 1:length(models)){
    model = models[[m]]
    model_comp[,m] = unlist(diagnostics_performance(model, plotit = FALSE, response_transform = predict_transforms[[m]], train_actuals = train_actuals, as_data_frame = FALSE))

    transform_func = function(x){x}
    if(!is.null(predict_transforms))
      transform_func = predict_transforms[[m]]
    
    predict_func = custom_predict(model, test_data = test_data, fitted_transform = transform_func)
    actuals = unlist(test_data[,test_actuals_col])
    
    rmse_test = get_rmse(model, predict_function = predict_func, test_actuals = actuals)
    model_comp[names(dummy) == "rmse.test", m] = rmse_test #
  }
  rownames(model_comp) = names(dummy)
  model_comp
}


#Helper function to allow custom response transforamtions, relevant to compare models.
custom_predict = function(model, test_data, fitted_transform = NULL){
  fitted.transform = ifelse(is.null(fitted_transform), function(x){x}, fitted_transform)
  function(model){
    fitted_transform(predict(model, test_data))
  }
}

#View model diagnostics with and without influential points.
compare_without_influential = function(model, train_data, test_data, test_actuals_col, response_transform = NULL){
  formula = as.character(model$call)[2]
  influentials = get_influentials(model)$influentials
  new_model = lm(as.formula(formula), data = train_data, subset = !influentials)
  
  compare_coef = compare_model_performance(list(model, new_model), predict_transforms = list(response_transform, response_transform), 
                                           test_data = test_data, test_actuals_col = test_actuals_col,  train_actuals = train_data[,test_actuals_col])
  
  colnames(compare_coef) = c("W/ Influential", "W/o Influential")
  compare_coef
}


#List all coefficients used by all models, along with a view of coefficient values
compare_parameters = function(models,  variables, model_names = 0){
  all_coefs = data.frame(coefficients = rownames(summary(models[[1]])$coefficients))
  col_names = c("coefficients")
  
  for(m in 1:length(models)){
    model = models[[m]]
    model_name = ifelse(model_names == 0, m, model_names[m])
    
    new_items = data.frame(coefficients = rownames(summary(model)$coefficients))
    for(v in 1:length(variables)){
      new_items = cbind(new_items,
                        variable = summary(model)$coefficients[,variables[v]])
      
      colnames(new_items)[v+1] =  paste(model_name,"_",v)
    }
    all_coefs = merge(all_coefs, new_items, by = "coefficients", all = TRUE)
  }
  all_coefs
}

Model building

Additive Models

  • All variables

The first step of the exploration is to look at a simple additive model that contains all variables.

#Training the model
additive_full = lm(aquisition_price_amount ~ . , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(additive_full)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 6.200000e+01
model.pval 0.000000e+00
r2 1.600000e-01
rmse.train 1.685398e+09
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.412779e+04
BIC 8.448821e+04
adjR2 1.300000e-01
rsme_loocv 1.767236e+09
outliers.num 3.900000e+01
outliers.pct 2.000000e-02
influential.num 5.400000e+01
influential.pct 3.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full), digits = 2)
coeficient vif
14 invested_companies 49.28
13 investment_rounds 49.22
7 country_codeUSA 46.53
10 cityother 45.72
12 cityunknown 44.11
6 country_codeunknown 28.09
8 state_codeunknown 25.73
9 citynew york 9.44
11 citysan francisco 8.59
5 country_codeother 7.07
3 category_codeunknown 6.84
4 country_codeGBR 5.66
2 category_codesoftware 5.36
1 category_codeother 5.10

We can see that model assumptions are clearly violated. We also see that there are variables with a potentially large variance inflation factor. For example, it would make sense to only use investment_rounds or invested_companies, or country or state, because there can be collinearity as discussed in the data exploration section. 1.97% and 2.73% of points are considered to be outliers and to have influence respectevely. Although we can aim to reduce these, in this first exploration of different additive models it does not seem likely these values will change much.

#Direct correlation between investment_rounds and invested_companies is close to one
cor(startups_tr$investment_rounds,startups_tr$invested_companies)
## [1] 0.9889835
#The partial correlation coefficient is also very close to zero.
#The variation of invested_companies that is unexplained by remaining predictors shows little correlation
#with the variation of aquisition_price_amount that is not explained remaining predictors.
#Adding invested companies is of little use to the model.
model_test_invested_companies_1 = lm(invested_companies ~ .-aquisition_price_amount , data = startups_tr)
model_test_invested_companies_2 = lm(aquisition_price_amount ~ .-invested_companies , data = startups_tr)
cor(resid(model_test_invested_companies_1), resid(model_test_invested_companies_2))
## [1] 0.1080731

We remove the invested_companies and state_code from the predictors and train a new model.

  • Improving collinearity
#Training the model
additive_full_updated = lm(aquisition_price_amount ~ .-investment_rounds -state_code , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(additive_full_updated)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 4.800000e+01
model.pval 0.000000e+00
r2 1.300000e-01
rmse.train 1.709368e+09
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.415563e+04
BIC 8.443466e+04
adjR2 1.100000e-01
rsme_loocv 1.764967e+09
outliers.num 3.500000e+01
outliers.pct 2.000000e-02
influential.num 4.400000e+01
influential.pct 2.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full_updated), digits = 2)
coeficient vif
9 cityother 32.01
11 cityunknown 30.18
7 country_codeUSA 25.38
6 country_codeunknown 24.98
5 country_codeother 7.07
3 category_codeunknown 6.81
10 citysan francisco 5.95
4 country_codeGBR 5.65
2 category_codesoftware 5.35
1 category_codeother 5.08
8 citynew york 5.03

In the plots, we can see that normality and constant variance are still suspect. We also see confirm that normality suspect due to the low shapiro-wilk test p-value, and that the constant variance is suspect due to low bp test p-value. The R squared and adjusted R squared remain somewhat similar, as well as the RMSE leave one out cross validation. But we now have less variables with a potentially large collinearity. Outlier and influential points remain somewhat similar.

  • Backwards AIC

Now we will perform a backwards AIC search to optimize for AIC and further decrease collinearity issues.

#Finding the model
model_additive_selected_AIC = step(additive_full_updated, trace = 0)

#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(na.omit(model_additive_selected_AIC)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 1.800000e+01
model.pval 0.000000e+00
r2 1.200000e-01
rmse.train 1.719315e+09
rmse.test NA
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.411857e+04
BIC 8.422321e+04
adjR2 1.200000e-01
rsme_loocv 1.757006e+09
outliers.num 3.900000e+01
outliers.pct 2.000000e-02
influential.num 5.000000e+01
influential.pct 3.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(model_additive_selected_AIC), digits = 2)
coeficient vif
2 category_codeunknown 5.96
1 category_codesoftware 5.16

The found model still violates normality and constant variance assumptions, as would normally be expected since this is derived from a model that also violated them. However, we are able to see less predictors with a variance inflation factor higher than 5. Outlier and influential points remain similar. Now we proceed to compare both models with an anova test and decide which is the best additive model.

  • Selecting the best additive model
anova(model_additive_selected_AIC, additive_full_updated)
## Analysis of Variance Table
## 
## Model 1: aquisition_price_amount ~ category_code + invested_companies + 
##     funding_rounds + relationships + top50pct + istop26_50
## Model 2: aquisition_price_amount ~ (category_code + country_code + state_code + 
##     city + investment_rounds + invested_companies + funding_rounds + 
##     funding_total_usd + milestones + relationships + top25pct + 
##     top50pct + istop25 + istop26_50 + name_length) - investment_rounds - 
##     state_code
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1959 5.8441e+21                            
## 2   1929 5.7767e+21 30 6.7425e+19 0.7505 0.8334

We can see that the larger model doesn’t offer a statistically significant difference in explaining the relationship of predictors and response. For that reason, from the additive models, we prefer the model found through the backwards BIC method model_additive_selected_AIC. This model still violates normality and constant variance assumptions, and these will not likely be corrected or at least further improved until we look at more complex interaction models or we apply transformations to predictor or response variables (done below).

All in all, we won’t be focusing on this purely additive model model_additive_selected_AIC because of the very low r squared and potential low power to make predictions.

Interaction Models

  • Two-way interactions

We started the exploration by training a 2-way interaction model, excluding investment_rounds and state_code from the variables, due to collinearity.

We also tried to fit an overall 3-way model in a couple of ways. First, by considering all terms besides investment_rounds and state_code, but due to computational limitations we decided to take a different approach (after 3 hours the model was not complete training yet). We also tried to fit an overall 3-way model, excluding city from the interaction terms (and adding it as a separate term on its own) to try to improve machine performance–altough this model offered a good initial R squared (~0.9) it was hard to interpret and still taxing on machine resources.

#Training the model
interaction_model = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(interaction_model))[1:9,], digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 651.00
model.pval 0.00
r2 0.74
rmse.train 938013733.45
bptest.pval 0.00
shapiro.pval 0.00
AIC 82988.77
BIC 86773.16
adjR2 0.61

Although this model mathematically has a good R squared and adjusted R squared, the normality and constant variance assumptions are still violated. Furthermore, because there are 906 predictors, this is not an easy to understand model. We will find a simplified model doing a backwards AIC search.

  • Forward AIC

Because the previous model has 906 predictors, we use a forward search to optimize for computational resources. We tried to follow use a BIC approach so that we optimize for reducing the number of predictors, but the resulting R squared was quite low (0.10), so we decided to continue with the AIC method.

#Finding the model
model_interaction_start = lm(aquisition_price_amount ~ 1, data = startups_tr)
model_interaction_selected = step(
  model_interaction_start,
  scope = ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length,
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_interaction_selected)[1:9,], digits = 2), caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 8.700000e+01
model.pval 0.000000e+00
r2 5.475401e-01
rmse.train 1.235132e+09
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.294879e+04
BIC 8.345454e+04

Although this model optimizes for AIC (it has a lower AIC than the first two-way interaction model), this model performs worse in other areas. In particular the both the R squared and adjusted R squared decreased (0.55 and 0.53 respectively). The normality and constant variance assumptions continue to be violated.

  • Selecting the best interaction model

Although we prefer the first two-way interaction model due to its better performance, we run an anova test to check the significance of the additional parameters included on the first two-way model vs the model found via forward AIC.

anova(model_interaction_selected, interaction_model)
## Analysis of Variance Table
## 
## Model 1: aquisition_price_amount ~ relationships + category_code + funding_rounds + 
##     istop26_50 + top50pct + invested_companies + funding_total_usd + 
##     milestones + relationships:category_code + category_code:funding_rounds + 
##     funding_rounds:istop26_50 + relationships:top50pct + category_code:invested_companies + 
##     category_code:istop26_50 + category_code:top50pct + relationships:invested_companies + 
##     funding_rounds:invested_companies + top50pct:invested_companies + 
##     relationships:istop26_50 + top50pct:milestones
## Model 2: aquisition_price_amount ~ ((category_code + country_code + state_code + 
##     city + investment_rounds + invested_companies + funding_rounds + 
##     funding_total_usd + milestones + relationships + top25pct + 
##     top50pct + istop25 + istop26_50 + name_length) - investment_rounds - 
##     state_code)^2
##   Res.Df        RSS  Df  Sum of Sq      F    Pr(>F)    
## 1   1890 3.0160e+21                                    
## 2   1326 1.7395e+21 564 1.2765e+21 1.7253 1.137e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova test confirms that the initial model does include predictors that are significant. The preferred interaction model is interaction_model.

Transformation Models

  • Boxcox method

We start with using the same predictors used in the first 2-way interaction model. We use a \(\lambda\) value of 0.12. Also, we apply a logarithmic transformation in the predictors funding_total_usd and relationships (+1 is added to these predictors because there are records with the value 0 which would lead to issues when applying the log transformation), as this showed an improvement in decreasing adjusted r squared and AIC.

#Initial model, before transformation
model_transformation_boxcox_initial = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Boxcox plot
boxcox(model_transformation_boxcox_initial, plotit = TRUE, lambda = seq(0.1, 0.14, by = 0.01))

#Training the model
model_transformation_boxcox = lm((((aquisition_price_amount ^ 0.12) - 1) / 0.12) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Define the transformation functions to revert back to the units of aquisition_price
boxcox_reverse_function = function(lambda = 0.12){
  function(val){
    (val*lambda + 1)^(1/lambda)
  }
}


#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 6.530000e+02
model.pval 0.000000e+00
r2 6.100000e-01
rmse.train 1.106672e+09
bptest.pval 1.000000e+00
shapiro.pval 0.000000e+00
AIC 1.165783e+04
BIC 1.545384e+04
adjR2 4.100000e-01

Although in this model we are seeing lower r squared and adjusted r squared values than we obtained with the two-way interaction model, now the constant variance assumption has improved (we can see this both in the fitted vs residuals plot, and in the result of the bp test, which provides a p-value of 0.9997102). Normality is still suspect.

  • Forward AIC, using boxcox model as the scope

We use the previously trained model as the full scope for a forward AIC search.

#Finding the model
model_transformation_boxcox_start = lm(((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ 1, data = startups_tr)
model_transformation_boxcox_selected = step(
  model_transformation_boxcox_start,
  scope =
    ((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
    log(funding_total_usd + 1) + log(relationships + 1),
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox_selected, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 9.600000e+01
model.pval 0.000000e+00
r2 6.500000e-01
rmse.train 1.376203e+10
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 1.029215e+04
BIC 1.085022e+04
adjR2 6.300000e-01
rsme_loocv 4.252724e+10
outliers.num 1.040000e+02
outliers.pct 5.000000e-02
influential.num 8.800000e+01
influential.pct 4.000000e-02

The model shows a significant improvement in performance (adjusted r squared, AIC, BIC) but the constant variance assumption is violated again. For this reason, we still prefer the original boxcox model model_transformation_boxcox.

We also try to remove influential points, just to compare what would happen with the model performance and assumptions.

## Evaluate whether removing influential observations helps on the model assumption and fit.
knitr::kable(compare_without_influential(model_transformation_boxcox_selected, train_data = startups_tr, test_data = startups_te, test_actuals_col = "aquisition_price_amount", response_transform = boxcox_reverse_function(lambda = 0.12)), digits = 2)
## Warning in train_actuals - fits: longer object length is not a multiple of
## shorter object length
## Warning in resids/(1 - hatvalues(model)): longer object length is not a multiple
## of shorter object length
W/ Influential W/o Influential
coefficients 9.600000e+01 9.600000e+01
model.pval 0.000000e+00 0.000000e+00
r2 6.500000e-01 7.000000e-01
rmse.train 1.376203e+10 5.323183e+09
rmse.test 3.988340e+11 3.855839e+12
bptest.pval 0.000000e+00 0.000000e+00
shapiro.pval 0.000000e+00 0.000000e+00
AIC 1.029215e+04 9.260600e+03
BIC 1.085022e+04 9.818670e+03
adjR2 6.300000e-01 6.900000e-01
rsme_loocv 4.252724e+10 1.178727e+10
outliers.num 1.040000e+02 1.030000e+02
outliers.pct 5.000000e-02 5.000000e-02
influential.num 8.800000e+01 6.700000e+01
influential.pct 4.000000e-02 4.000000e-02

We can see that there are not significant changes in r squared, adjusted r squared, and in model assumptions (bptest and wilk-shapiro test). Test RMSE is actually worse for a theoretical model without influential values. All in all, we still prefer the original boxcox model model_transformation_boxcox.

  • Logarithmic transformation in response

We also explore using a logarithmic transformation in the response variable, using the same predictors originally used for the first model fitted with the boxcox method above.

#Training the model
model_transformation_log = lm(log(aquisition_price_amount) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_log, response_transform = exp, train_actuals = startups_tr$aquisition_price_amount)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 6.530000e+02
model.pval 0.000000e+00
r2 6.000000e-01
rmse.train 1.212812e+09
bptest.pval 1.000000e+00
shapiro.pval 0.000000e+00
AIC 3.930550e+03
BIC 7.726560e+03
adjR2 4.100000e-01
rsme_loocv Inf

In R squared and in predicted R squared this model performs marginally worse than the boxcox model, and the constant variance assumption doesn’t seem suspect from the bp test (although we can see it’s not fully met either from the fitted vs residuals plot). However, in AIC and in BIC, this model shows a more significant drop in performance (drop from 3274 to 3931 in AIC and from 7070 to 7727 in BIC). For that reason we prefer the boxcox model model_transformation_boxcox as a transformation-based model

  • Forward AIC, using lorarigthmic model as the scope

We also use the previously trained log model as the full scope for a forward AIC search.

#Finding the model
model_transformation_log_start = lm(log(aquisition_price_amount) ~ 1, data = startups_tr)
model_transformation_log_selected = step(
  model_transformation_log_start,
  scope =
    log(aquisition_price_amount) ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
    log(funding_total_usd + 1) + log(relationships + 1),
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_log_selected, response_transform = exp, train_actuals = startups_tr$aquisition_price_amount)), digits = 2, caption = "Performance & Assumptions")

Performance & Assumptions
coefficients 8.900000e+01
model.pval 0.000000e+00
r2 5.800000e-01
rmse.train 1.058620e+12
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 2.898790e+03
BIC 3.416160e+03
adjR2 5.600000e-01
rsme_loocv 3.014162e+12
outliers.num 1.020000e+02
outliers.pct 5.000000e-02
influential.num 7.300000e+01
influential.pct 4.000000e-02

Although this model shows improvements in AIC and BIC, other metrics do not look as good. And the constant variance assumption is more suspect now. For that reason the original logarithmic model model_transformation_log seems better, and overall, we still prefer the original boxcox model model_transformation_boxcox as a transformation-based model.

Summary

In summary, the two preferred models are: interaction_model (interaction), model_transformation_boxcox (transformation). We are omitting model_additive_selected_AIC (additive) because of its low r squared.

Results

#for interaction_model and model_additive_selected_AIC,no response transformation
simple_function = function(x) {
  x
}

comp = compare_model_performance(
  models = list(interaction_model, model_transformation_boxcox), 
  predict_transforms = list(simple_function, boxcox_reverse_function(lambda = 0.12)),
  test_data = startups_te, test_actuals_col = "aquisition_price_amount", train_actuals = startups_tr$aquisition_price_amount)
## Warning in predict.lm(model, test_data): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(model, test_data): prediction from a rank-deficient fit
## may be misleading
colnames(comp) = c("interaction_model", " model_transformation_boxcox")
comp = as.data.frame(t(as.matrix(comp)))
colnames(comp)
##  [1] "coefficients"    "model.pval"      "r2"              "rmse.train"     
##  [5] "rmse.test"       "bptest.pval"     "shapiro.pval"    "AIC"            
##  [9] "BIC"             "adjR2"           "rsme_loocv"      "outliers.num"   
## [13] "outliers.pct"    "influential.num" "influential.pct"
comp
##                              coefficients    model.pval        r2 rmse.train
## interaction_model                     651 2.889544e-160 0.7390411  938013733
##  model_transformation_boxcox          653  1.516075e-68 0.6051143 1106671506
##                                rmse.test bptest.pval shapiro.pval      AIC
## interaction_model            11378779931 0.000071699 8.984570e-55 82988.77
##  model_transformation_boxcox         NaN 0.999710167 1.092363e-66 11657.83
##                                   BIC     adjR2 rsme_loocv outliers.num
## interaction_model            86773.16 0.6111201        NaN           NA
##  model_transformation_boxcox 15453.84 0.4106540        NaN           NA
##                              outliers.pct influential.num influential.pct
## interaction_model                      NA              NA              NA
##  model_transformation_boxcox           NA              NA              NA
#knitr::kable(comp, digits = 2, caption = "Performance & Assumptions")

Discussion

Throughout this excercise we have seen that building a model is part art and part science. There are several aspects and trade-offs to be considered, including but not limited to: model assumptions, model prediction capabilities, how easy to understand the model is (inference), and actual computational power (some attempted routes were ultimately not successful due to computational limitations).

Due to the complex relationships between predictors and response in this dataset, settling on a simple additive model was not enough and we had to fit models with interaction terms and transformations that are not as intuitive to understand but that could improve the models prediction capabilities. Although both models have more than 650 predictors, we made an attempt at drawing inferential conclusions and validate if some of the variables one would intuitively associate with the acquisition price of a startup are actually significant.

Interaction model

The top most significant predictors in the interaction model are not very easy to understand. We can still see that predictors that look intuitively useful like invested_companies (number of companies that invested in the start up before) and istop25 (one of the founders came from a top 25 university) do appear in some of the top interaction terms.

#Most statistically significant terms in the interaction model
order_coeff_interaction = summary(interaction_model)$coefficients[order(summary(interaction_model)$coefficients[, 4]), ]
top_coeff_interaction = cbind(b_estimate = order_coeff_interaction[1:5,1], p_value = order_coeff_interaction[1:5,4])
row.names(top_coeff_interaction) = row.names(top_coeff_interaction[1:5,])
top_coeff_interaction
##                                                   b_estimate      p_value
## category_codepublic_relations:invested_companies 35514601304 4.410399e-24
## country_codeGBR:top25pct                          -237009393 1.137214e-09
## category_codepublic_relations:istop26_50          7294535816 1.026432e-08
## country_codeGBR:istop25                           9008423771 1.633846e-07
## istop25:istop26_50                               -3657667885 6.622019e-07

However, if we check interesting terms that seem useful (invested_companies, istop25, and funding_total_usd) independently we can see that they do not appear significant. And although funding_total_usd has a positive estimate (ie. the higher the funding in dollars, the higher the acquisition price), both invested_companies and istop25 have a negative estimate (the more companies invested on a startup the lesser the acquistion price, and having a founder from a top25 university also decreases acquisition price in this model).

#Test for significant of interesting predictors
#funding_total_usd
funding_total_estimate_int = summary(interaction_model)$coefficients[rownames(summary(interaction_model)$coefficients)=="funding_total_usd",c(1,4)]

#invested_companies
invested_companies_estimate_int = summary(interaction_model)$coefficients[rownames(summary(interaction_model)$coefficients)=="invested_companies",c(1,4)]

#istop25
istop25_estimate_int = 
summary(interaction_model)$coefficients[rownames(summary(interaction_model)$coefficients)=="istop25",c(1,4)]

results_int = rbind(funding_total_estimate_int, invested_companies_estimate_int, istop25_estimate_int)

#Results
knitr::kable(results_int)
Estimate Pr(>|t|)
funding_total_estimate_int 5.982652e+02 0.9667103
invested_companies_estimate_int -3.126975e+09 0.3365084
istop25_estimate_int -2.447240e+08 0.9789133

Boxcox model

Some of the most significant predictors in the boxcox model also consist of interactions that are not as easy to explain. However, we do see two predictors that are straightforward: log(relationships) and country_codeFRA, and they both have a positive estimate. This is roughly predicting that the more relationships the startup has built, and that if the startup is based in France, the higher the acquisition price.

#Most statistically significant terms in the boxcox model
order_coeff_boxcox = summary(model_transformation_boxcox)$coefficients[order(summary(model_transformation_boxcox)$coefficients[, 4]), ]
top_coeff_boxcox = cbind(b_estimate = order_coeff_boxcox[1:5,1], p_value = order_coeff_boxcox[1:5,4])
row.names(top_coeff_boxcox) = row.names(top_coeff_boxcox[1:5,])
top_coeff_boxcox
##                                 b_estimate      p_value
## log(relationships + 1)            9.325663 1.517758e-06
## istop25:istop26_50              -28.757234 7.809139e-03
## category_codemobile:citylondon  -97.119663 8.586193e-03
## country_codeCAN:citylondon     -151.202899 1.431657e-02
## country_codeFRA                 149.808561 1.590183e-02

On the flip side, interesting predictors (invested_companies, istop25, and funding_total_usd) continue to not offer a statistically significant explanation for startup acquisition price in this model. And both invested_companies and istop25 continue to have a negative estimate.

#Test for significant of interesting predictors
#funding_total_usd
funding_total_estimate_bc = summary(model_transformation_boxcox)$coefficients[rownames(summary(model_transformation_boxcox)$coefficients)=="funding_total_usd",c(1,4)]

#invested_companies
invested_companies_estimate_bc = summary(model_transformation_boxcox)$coefficients[rownames(summary(model_transformation_boxcox)$coefficients)=="invested_companies",c(1,4)]

#istop25
istop25_estimate_bc = 
summary(model_transformation_boxcox)$coefficients[rownames(summary(model_transformation_boxcox)$coefficients)=="istop25",c(1,4)]

results_bc = rbind(funding_total_estimate_bc, invested_companies_estimate_bc, istop25_estimate_bc)

#Results
knitr::kable(results_bc)
Estimate Pr(>|t|)
funding_total_estimate_bc -0.0002087 0.3199398
invested_companies_estimate_bc -37.9261650 0.4259821
istop25_estimate_bc -127.4439156 0.3469618

Summary

“It’s Difficult to Make Predictions, Especially About the Future”

What started with an idea to produce an easy to understand model to predict startup acquisition price resulted in a complex excercise. This involved data wrangling, data engineering and spotting wrong data inputs, building models while considering tradeofffs, and understanding and evaluating the performance of different models.

Predictors we thought could offer a good explanation for the acquisition price of a startup did not seem to be significant in the end. And interesting predictors, such as the startup country of origin (in particular, France), resulted having an impact that seems to be significant in explaning the startup price. We also saw the importance of having a well-balanced and complete data set. Through training the models and evaluating their performance we noticed that it is hard to make predictions if the data is not good enough or if we are making predictions that are an extrapolation.

As potential next steps, if this dataset was to be studied further, we propose to try to identify specific categorical variables that have the most impact. In our exploration, all categorical variables were considered, which made the resulting models quite complex. If there is a way to further reduce the number of categorial variables, perhaps the resulting models could be better for explanations and inferential analyses.

Appendix: Data Pre-cleaning

Data load

The startup data comes from Kaggle: [https://www.kaggle.com/justinas/startup-investments]

This dataset contains information about the startup ecosystem: organizations, individuals, company news, funding rounds, acquisitions, and IPOs, extracted originally from the Crunchabse Data website. In this project, only 6 of the 11 tables will be used and joined using unique IDs.

  • People Data Loading:
library(readr)

#People
people = read_csv("./people.csv")
colnames(people)[2] = "person_id"
attr(people, "spec") = NULL
colnames(people)
  • People Education Data Loading:
degrees = read_csv("./degrees.csv")
colnames(degrees)[2] = "person_id"
attr(degrees, "spec") = NULL
colnames(degrees)
  • Companies Data Loading:
#Company
companies = read_csv("./objects.csv")
attr(companies, "spec") = NULL
attr(companies, "problems") = NULL
colnames(companies)[1] = "company_id"
colnames(companies)
  • People/Company Relationship Data Loading:
comp_peep = read_csv("./relationships.csv")
colnames(comp_peep)[3] = "person_id"
colnames(comp_peep)[4] = "company_id"
colnames(comp_peep)
attr(comp_peep, "spec") = NULL
#str(comp_peep)
  • Company Aquisition Data Loading:
aquisitions = read_csv("./acquisitions.csv")
colnames(aquisitions)[4] = "company_id"
colnames(aquisitions)[6] = "aquisition_price_amount"
attr(aquisitions, "spec") = NULL
colnames(aquisitions)

#Filter out companies with aquisition amount not in USD (all other variable/predictors are in USD)
aquisitions = aquisitions[aquisitions$price_currency_code == "USD",c("company_id", "aquisition_price_amount")]

Data Preparation

After loading the individual tables, we need to join all into a single dataset that can be used for analysis.

Company Founders

The people, degrees and their relationship to the companies can all be joined into a sigle people dimension table. Additionally, we can extract potential relevant predictors for regression. For every person, generally an executive, we identify in each company people that attended to any of the top 25 or top 50 universities in the world (“top 50” excludes the top 25 universities). Also, we count the number of people in each company as well as the count of people by company that attended the top 25 and top 50 universities. Finally, to normalize these numbers, we also additional columns that represent percentage of people in each company that attended the top 25 and top 50 universities.

#Auxiliary functions to count rows for analysis of data
count_disticts = function(x){
  length(unique(x))
}
counts = function(x){
  length(x)
}

#Join people with their degrees information and with which company these people belong to.
full_peep = merge(people, degrees, by="person_id")
founders = merge(full_peep, comp_peep, by="person_id")

top1 = grep(".*University of Oxford.*|.*xford.*|.*California Institute of Technology.*|.*altech|.*University of Cambridge.*|.*ambridge.*|.*Stanford University.*|.*tanford.*|.*Massachusetts Institute of Technology.*|.*MIT.*|.*Princeton University.*|.*inceton.*|.*Harvard University.*| .*arvard#|.*Yale University.*|.*Yale.*|.*University of Chicago.*|.*Imperial College London.*|.*University of Pennsylvania.*|.*Johns Hopkins University.*|.*University of California Berkeley.*|.*Berkley.*|.*ETH Zurich.*|.*UCL.*|.*Columbia University.*|.*Columbia.*|.*University of California Los Angeles.*|.*University of Toronto.*|.*Cornell University.*|.*Cornell.*|.*Duke University.*|.*University of Michigan-Ann Arbor.*|.*Northwestern University.*|.*Tsinghua University.*|.*Peking University.*|.*National University of Singapore.*", ignore.case = TRUE, x = founders$institution)

top2 = grep(".*University of Washington.*|.*Carnegie Mellon University.*|.*Carnegie.*|.*London School of Economics and Political Science.*|.*LSE.*|.*New York University.*|.*University of Edinburgh.*|.*University of California San Diego.*|.*LMU Munich LMU.*|.*University of Melbourne.*|.*University of British Columbia.*|.*University of Hong Kong.*|.*King’s College London Kings College London.*|.*The University of Tokyo.*|.*École Polytechnique Fédérale de Lausanne.*|.*Lausanne.*|.*Georgia Institute of Technology.*|.*University of Texas at Austin.*|.*Karolinska Institute.*|.*McGill University.*|.*Technical University of Munich.*|.*Heidelberg University.*|.*KU Leuven.*|.*Paris Sciences et Lettres – PSL Research University Paris.*|.*PSL.*|.*The Hong Kong University of Science and Technology.*|.*University of Illinois at Urbana-Champaign.*|.*University of Illinois.*|.*Urbana-Champaign.*|.*Urbana.*|.*Champaign.*|.*Nanyang Technological University .*|.*Australian National University.*", ignore.case = TRUE, x = founders$institution)

top25 = rep(0, nrow(founders))
top50 = rep(0, nrow(founders))
top25[top1] = 1
top50[top2] = 1
founders["top25"] = top25
founders["top50"] = top50
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id, person_id = founders$person_id), FUN = max)
peeps = aggregate(x = founders[,c("company_id", "person_id")], by = list(company_id = founders$company_id), FUN = counts)
peeps = peeps[,c(1,2)]
colnames(peeps)[2] = "people"
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id), FUN = sum)
founders = merge(founders, peeps[,c(1,2)], by="company_id")
founders$top25pct = round(100*founders$top25/founders$people)
founders$top50pct = round(100*founders$top50/founders$people)
founders = na.omit(founders[,-c(2,3,4)])

Final Dataset For Analysis

After extracting the individual data aspects (people, company, investments) of the companies, these aspects are joined into a single dataset. The dataset that will be used for this project will depend on the resulting row numbers after different combination of joins.

company_aq = merge(companies, aquisitions, by = "company_id")
#nrow(company_aq)
company_aq_lds = merge(x = company_aq, y = founders, by = "company_id", all.x = TRUE)
#nrow(company_aq_lds)

Preminiary Analysis

To ensure that the dataset is viable for linear regression analysis, we create a preliminary linear model with company aquisition price as the response variable, and several other potential columns in the startups dataset as predictors. We are using the company_aq dataset above.

# Data set to be analized, filtering out aquisition values without aquisition price
startups = company_aq_lds[company_aq_lds$aquisition_price_amount>0, ]
nrow(startups)
col_remove = c("entity_type", "entity_id", "parent_id", "name", "permalink", "founded_at", "closed_at", "domain", "homepage_url", "twitter_username", "logo_url", "short_description", "description", "overview", "tag_list", "region", "first_investment_at", "last_investment_at", "first_funding_at", "last_funding_at", "first_milestone_at", "last_milestone_at", "created_by", "created_at", "updated_at")

startups = startups[,-which(colnames(startups) %in% col_remove)]

#Nulls by column
show_nas = function(somedf){
  cols_nas = data.frame(column = colnames(somedf), numNAs = rep(0, ncol(somedf)))
  for(i in 1:ncol(somedf)){
    cols_nas[i, "numNAs"] = length(somedf[is.na(somedf[,i]),i])
  }
  cols_nas$pctNAs = round(100*cols_nas$numNAs / nrow(somedf), digits = 1)
  cols_nas
}
show_nas(startups)

We clean NA values below:

#clean NAs; introduce "Unknown" for categorical variables where ther are N/As
startups[is.na(startups[,"category_code"]),"category_code"] = "unknown"
startups[is.na(startups[,"country_code"]),"country_code"] = "unknown"
startups[is.na(startups[,"city"]),"city"] = "unknown"
startups[is.na(startups[,"state_code"]),"state_code"] = "unknown"

#For numeric variablestop25pct, top50pct; will convert all of these into a categorical variable that says if people in company attended universities in top 25 universities in worl, then 1. Similarly, if company people (executives), attended universites in the top 26 to 50 universites `istop26_50`, 1; else, 0.
startups[is.na(startups[,"top25pct"]),"top25pct"] = 0
startups[is.na(startups[,"top50pct"]),"top50pct"] = 0
startups$istop25 = ifelse(startups$top25pct > 0, 1, 0)
startups$istop26_50 = ifelse(startups$top50pct > 0, 1, 0)

#Add a few additional predictors
startups$name_length = nchar(startups$normalized_name)

show_nas(startups)

With clean data, we can now perform the data split for training and data

#Generate the full data
write_csv(x = startups, path = "./startups.csv")

set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]

#Save the train and test datasest
write_csv(x = startups_tr, path = "./startups_tr.csv")
write_csv(x = startups_te, path = "./startups_te.csv")

The dataset to be used, startups has 2472 rows of data.

A sample output of the data is below:

aquisition_price_amount city funding_total_usd category_code funding_rounds milestones relationships
20000000 other 0 games_video 0 0 6
47500000 mountain view 5000000 web 1 3 14
15000000 san francisco 3000000 games_video 1 2 6
262500000 other 274999 hardware 1 0 4
36500000 other 15286415 other 3 1 2
100000000 other 28000000 hardware 1 3 7

Moreover, a very simple model illustrates feasibility for linear regression analysis:

#Linear Model
company_model = lm(log(aquisition_price_amount) ~ log(funding_total_usd+1) + category_code + funding_rounds + milestones + relationships + istop25 + istop26_50, data=startups)
summary(company_model)
## 
## Call:
## lm(formula = log(aquisition_price_amount) ~ log(funding_total_usd + 
##     1) + category_code + funding_rounds + milestones + relationships + 
##     istop25 + istop26_50, data = startups)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3812  -1.1325   0.2108   1.6253   8.0076 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   16.574302   0.300837  55.094  < 2e-16 ***
## log(funding_total_usd + 1)     0.002897   0.011861   0.244 0.807083    
## category_codebiotech           1.422294   0.336273   4.230 2.43e-05 ***
## category_codeenterprise        0.240485   0.378627   0.635 0.525389    
## category_codegames_video      -0.010138   0.403346  -0.025 0.979949    
## category_codehardware          0.628013   0.417439   1.504 0.132596    
## category_codemobile            0.174453   0.378273   0.461 0.644708    
## category_codenetwork_hosting   0.306121   0.438431   0.698 0.485107    
## category_codeother             0.781903   0.319944   2.444 0.014601 *  
## category_codepublic_relations  0.778358   0.422489   1.842 0.065549 .  
## category_codesemiconductor     0.873731   0.428455   2.039 0.041531 *  
## category_codesoftware          0.420082   0.318961   1.317 0.187951    
## category_codeunknown          -1.909558   0.320396  -5.960 2.89e-09 ***
## category_codeweb              -0.207972   0.333049  -0.624 0.532391    
## funding_rounds                -0.108816   0.071172  -1.529 0.126414    
## milestones                     0.171153   0.059412   2.881 0.004001 ** 
## relationships                  0.024988   0.004549   5.494 4.35e-08 ***
## istop25                        1.202162   0.137580   8.738  < 2e-16 ***
## istop26_50                     0.627238   0.168145   3.730 0.000196 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.637 on 2453 degrees of freedom
## Multiple R-squared:  0.2632, Adjusted R-squared:  0.2578 
## F-statistic: 48.69 on 18 and 2453 DF,  p-value: < 2.2e-16
  • \(R^2= 0.263225\)
  • \(\text{p-value} = 5.9345821\times 10^{-148}\)

In addition, we can see how the linear regression assumptions of normality and constant variance are with the simple model. In the study, will better adjust tthe model in question to more appropriately describe the data with the model including its variance.

#Plots of the model
par(mfrow = c(1,2))
plot(company_model, which = 1:2)